Skip to content

Conversation

@ravatex
Copy link
Contributor

@ravatex ravatex commented Oct 31, 2025

Refs #13604

This PR includes all of the logic components of the RefCheck functionality.

With this we can implement refcheck in the CLI and GUI of jabref.

The added class, RefChecker, searches through different Fetchers to find matching or similar entries and returns either Fake, Real, or Unsure with related found entries.

To achieve this a new method in the DuplicateCheck class has been made, that finds the similarity of different Entries and returns a double between 0 and 1 depending on how similar the entries are.

Steps to test

Read the Unit Tests and run them to see how the classes are meant to work.

Mandatory checks

  • I own the copyright of the code submitted and I license it under the MIT license
  • I manually tested my changes in running JabRef (always required)
  • I added JUnit tests for changes (if applicable)
  • [/] I added screenshots in the PR description (if change is visible to the user)
  • [/] I described the change in CHANGELOG.md in a way that is understandable for the average user (if change is visible to the user)
  • [/] I checked the user documentation: Is the information available and up to date? If not, I created an issue at https://github.com/JabRef/user-documentation/issues or, even better, I submitted a pull request updating file(s) in https://github.com/JabRef/user-documentation/tree/main/en.

@ravatex
Copy link
Contributor Author

ravatex commented Nov 1, 2025

In my implementation of DuplicateCheck, should I ignore any non-standard fields?

Would that be preferable?

Copy link
Member

@subhramit subhramit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, good work so far. I am leaving some comments.

* @return number [0,1] 1 representing the same (one potentially having more fields), 0 representing completely different
*/
public double degreeOfSimilarity(final BibEntry one, final BibEntry two) {
return one.getFields((f) -> two.getField(f).isPresent())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid using abbreviations

.stream().mapToDouble((field) -> {
String first = one.getField(field).get();
String second = two.getField(field).get();
return new StringSimilarity().similarity(first, second);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't create a new StringSimilarity instance for each field

Comment on lines 31 to 33
public RefChecker(
DoiFetcher doiFetcher,
ArXivFetcher arXivFetcher) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only two parameters, so can be in a single line. Use one in each line when they are >=4

public RefChecker(
DoiFetcher doiFetcher,
ArXivFetcher arXivFetcher) {
this(doiFetcher, arXivFetcher, new CrossRef(), new DuplicateCheck(new BibEntryTypesManager()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try to inject bibentrytypesmanager via constructor. If you are unable to, use the injector (Injector.instantiateModelOrService(BibEntryTypesManager.class))

}

Optional<BibEntry> other = fetcher.performSearchById(doi.get().asString());
return other.map(o -> compareReferences(entry, o))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid abbreviations

Comment on lines +235 to +237
public static final class Fake extends ReferenceValidity {
public boolean equals(Object o) {
return o.getClass() == Fake.class;
Copy link
Member

@subhramit subhramit Nov 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Note: Don't take this suggestion right away, let another maintainer confirm)
Creating a new Fake every time seems a bit odd as unlike Real, it is a stateless marker. This should be a singleton.
Maybe do something like

public static final Fake INSTANCE = new Fake();
    
private Fake() {}
    
@Override
public boolean equals(Object o) {
    return o instanceof Fake;
}

and for the hashcode return an arbitrary constant.
This way, you can just do something like return Fake.INSTANCE instead of return new Fake().

@calixtus do you have a better idea? Something feels odd about the design.

Comment on lines 208 to 210
Set<BibEntry> matchingReferences = new HashSet<>();
matchingReferences.add(matchingReference);
this.matchingReferences = matchingReferences;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Set<BibEntry> matchingReferences = new HashSet<>();
matchingReferences.add(matchingReference);
this.matchingReferences = matchingReferences;
this.matchingReferences = new HashSet<>(Set.of(matchingReference));

return this;
}
if (other instanceof Unsure otherUnsure && this instanceof Unsure thisUnsure) {
otherUnsure.addAll(thisUnsure);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An or method, in principle, should return a new result and not mutate its inputs

@Test
void validateListOfEntriesTest() throws FetcherException {
List<BibEntry> entries = List.of(realEntry, realEntryNoDoi, fakeEntry);
var e = refChecker.validateListOfEntries(entries);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove abbreviations and var usages from this file.
(usually we don't fret so much about tests but in case we ever need to debug them, readability always helps)

@Test
void closeToRealEntry() throws FetcherException {
RefChecker.ReferenceValidity rv = refChecker.referenceValidityOfEntry(closeToRealEntry);
System.out.println(((RefChecker.Unsure) rv).matchingReferences);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove

@subhramit subhramit requested a review from koppor November 1, 2025 19:28
@ravatex
Copy link
Contributor Author

ravatex commented Nov 2, 2025

Should I ignore internal fields for degreeOfSimilarity, they are not fields that directly relate to the reference/bib so maybe they should be ignored? That makes sense at least from the RefChecker standpoint, but I could just make another method, that ignores internal fields and another that doesn't.

@koppor
Copy link
Member

koppor commented Nov 3, 2025

This is a difficult question.

Use case 1: Papers of a PDF and a fetcher: These fields probably won't appear.

Use case 2: Paper by fetcher is added to library: JabRef fields can be ignored.

Use case 3: Paper of library A is copied to library B: As user, I want the two entries being identified as similar even if "my" grouping, attached files, etc are different.

Result: Yes, can be ignored. Has to be documented at JavaDoc.

@calixtus
Copy link
Member

calixtus commented Nov 3, 2025

They can be ignored, a fetcher will never return a special field value, they are JabRef specific.

Comment on lines +616 to +620
@Test
void degreeOfSimilarityOfSameEntryIsOne() {
assertEquals(1.0, duplicateChecker.degreeOfSimilarity(getSimpleArticle(), getSimpleArticle()));
assertEquals(1.0, duplicateChecker.degreeOfSimilarity(getSimpleInCollection(), getSimpleInCollection()));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please only a single assertion per method - see https://devdocs.jabref.org/code-howtos/testing.html

I think, you can rewrite using @ParamterizedTest.

I think, this can also be merged entriesWithNoMatchingFieldHaveNoSimilarity and more.


public static abstract sealed class ReferenceValidity permits Real, Unsure, Fake {

public ReferenceValidity or(ReferenceValidity other) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add JavaDoc (///) for this.

Comment on lines +183 to +185
assertEquals(t1, t1.or(t2));
assertEquals(t1, t1.or(t3));
assertEquals(t2, t3.or(t2));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get this... t1 is always equal to t1. Why is t1.or(t2) wirtten down there? 😅

}

@Test
void findsRealEntry() throws FetcherException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, all tests can be rewritten to @ParamterizedTest

Comment on lines +94 to +100
StandardField.PDF,
StandardField.PS,
StandardField.URL,
StandardField.DOI,
StandardField.FILE,
StandardField.ISBN,
StandardField.ISSN)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why we need all of them - PDF and PS are outdated.

* If the result is one, it means that there was at least one common field
* and all the common fields were the same.
* <p>
* Similar entries have a score of above 0.8
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0.8 should be constant in DuplicateCheck / linked from there.

Comment on lines +29 to +32
DoiFetcher doiFetcher;
ArXivFetcher arxivFetcher;
CrossRef crossRef;
DuplicateCheck duplicateCheck;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be private?

Comment on lines +49 to +59
/**
* Tries to find the best reference validity
* among current ways. If any of the methods signal
* that it is real, it returns early.
* <p>
* DoiFetcher -> CrossRef -> ArxivFetcher
*
* @param entry entry checking
* @return the reference validity
* @throws FetcherException any error from fetchers
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try to use Markdown as JavaDoc format. Otherwise you need to escape >.

Comment on lines +254 to +256
public int hashCode() {
return Objects.hashCode(Fake.class);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No neeed for an implementation, JDK's default works well in most cases

Override is only necessary if equals is overridden: https://thefinestartist.com/effective-java/09

ReferenceValidity get() throws FetcherException;
}

public static abstract sealed class ReferenceValidity permits Real, Unsure, Fake {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe

record(Validity validtiy, @Nullable BibTeX otherEntry)

And Valildity being an enum - and otherEntry allowed for null in the fake entry.

I am not sure if there could be a group of other entries in some cases; but this design would be more future proof and more readable. @subhramit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants